Data Pre-Processing to Train a Better Lithuanian-English MT System

نویسندگان

  • Daiga Deksne
  • Raivis Skadins
چکیده

Pried -as ir Protokol -as yr -a neatskiriam -a ši -o Susitar -imo dal -is. Prefixes separated, endings replaced by tense and number feature values System #2 Prefixes separated, all endings replaced by number feature values and verb endings also by time feature values System #3 Prefixes separated, endings deleted System #4 As Lithuanian is highly inflected language, the words change the form according to grammatical function. That means that the endings of nouns, pronouns, adjectives, numerals and verbs change depending on certain features. English instead does not have such a rich feature system. This difference between languages significantly impacts word and phrase alignment when training an SMT system. Typically one or two forms of an English word have to be aligned to more than ten different surface forms of a corresponding Lithuanian word. Lithuanian verbs have prefixes indicating negation and other semantic features while English verbs do not have prefixes and such information is expressed using modifying words. Many word forms are not as common as others in the corpus, therefore a Lithuanian-English SMT system does not translate all word forms equally well. It is very common to get many out of vocabulary words when translating from Lithuanian into English. Chosen approach

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Language Model Data Augmentation for Keyword Spotting in Low-Resourced Training Conditions

This research extends our earlier work on using machine translation (MT) and word-based recurrent neural networks to augment language model training data for keyword search in conversational Cantonese speech. MT-based data augmentation is applied to two language pairs: English-Lithuanian and English-Amharic. Using filtered N-best MT hypotheses for language modeling is found to perform better th...

متن کامل

English-Lithuanian-English Machine Translation lexicon and engine: current state and future work

ENGLISH-LITHUANIAN-ENGLISH MACHINE TRANSLATION LEXICON AND ENGINE: CURRENT STATE AND FUTURE WORK Gintaras Barisevi ius, Bronius Tamulynas Kaunas University of Technology This article overviews the current state of the English-Lithuanian-English machine translation system. The first part of the article describes the problems that system poses today and what actions will be taken to solve them in...

متن کامل

Evaluation Methodology and Results for English-to-Arabic MT

This paper describes the evaluation campaign of the MEDAR project for English-to-Arabic (EnAr) MT systems. The campaign aimed at establishing some basic facts about the state of the art for MT on EnAr, collecting enough data to better train and tune systems and assessing the improvements made. The paper details the data used and their formats, the evaluation methodology and the results obtained...

متن کامل

The MIT-LL/AFRL IWSLT-2010 MT system

This paper describes the MIT-LL/AFRL statistical MT system and the improvements that were developed during the IWSLT 2010 evaluation campaign. As part of these efforts, we experimented with a number of extensions to the standard phrase-based model that improve performance on the Arabic and Turkish to English translation tasks. We also participated in the new French to English BTEC and English t...

متن کامل

Teaching MT Through Pre-editing: Three Case Studies

This article reports on three cases of teaching translation or English as a foreign language using pre-editing tasks with a machine translation system. Trainee translators or English learners were asked to input a Chinese or English paragraph into an MT system, observe the irregularities in the output, and subsequently edit the source text and input it again in the hope of getting better output...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012